The following report looks at a sleep study dataset and inspects the column values, plots the correlation between multiple variables, and takes various random samples.
## Rows: 374
## Columns: 14
## $ Person.ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
## $ Gender <chr> "Male", "Male", "Male", "Male", "Male", "Male"…
## $ Age <int> 27, 28, 28, 28, 28, 28, 29, 29, 29, 29, 29, 29…
## $ Occupation <chr> "Software Engineer", "Doctor", "Doctor", "Sale…
## $ Sleep.Duration <dbl> 6.1, 6.2, 6.2, 5.9, 5.9, 5.9, 6.3, 7.8, 7.8, 7…
## $ Quality.of.Sleep <int> 6, 6, 6, 4, 4, 4, 6, 7, 7, 7, 6, 7, 6, 6, 6, 6…
## $ Physical.Activity.Level <int> 42, 60, 60, 30, 30, 30, 40, 75, 75, 75, 30, 75…
## $ Stress.Level <int> 6, 8, 8, 8, 8, 8, 7, 6, 6, 6, 8, 6, 8, 8, 8, 8…
## $ BMI.Category <chr> "Overweight", "Normal", "Normal", "Obese", "Ob…
## $ Blood.Pressure <chr> "126/83", "125/80", "125/80", "140/90", "140/9…
## $ Heart.Rate <int> 77, 75, 75, 85, 85, 85, 82, 70, 70, 70, 70, 70…
## $ Daily.Steps <int> 4200, 10000, 10000, 3000, 3000, 3000, 3500, 80…
## $ Sleep.Disorder <chr> "None", "None", "None", "Sleep Apnea", "Sleep …
## $ Disorder.Exists <chr> "No", "No", "No", "Yes", "Yes", "Yes", "Yes", …
Males and females are nearly equally represented in the dataset
## Var1 Freq
## 1 Female 185
## 2 Male 189
There are 11 occupations listed, with the majority of people working as nurses, doctors and engineers. The least represented occupations are managers, sales representatives and scientists/software engineers.
## Var1 Freq
## 1 Accountant 37
## 2 Doctor 71
## 3 Engineer 63
## 4 Lawyer 47
## 5 Manager 1
## 6 Nurse 73
## 7 Sales Representative 2
## 8 Salesperson 32
## 9 Scientist 4
## 10 Software Engineer 4
## 11 Teacher 40
The minimum age is 27, the median is 43, and the maximum is 59. There are no outliers in age. There are slightly more younger people than older represented, excluding the median.
## Min Q1 Median Q3 Max
## 27 35 43 50 59
## [1] "The number of people younger than the median age: 186"
## [1] "The number of people older than the median age: 154"
Most people do not have a sleep disorder, and a similar number of people suffer from Insomnia and Sleep Apnea. It looks like there may be a correlation between sleep disorders and weight. Among the people with no sleep disorder, the majority also have a normal BMI, with about 5% of them being overweight. However, the majority of people with sleep disorders are also overweight, and some are obese.
##
## Normal Obese Overweight
## Insomnia 9 4 64
## None 200 0 19
## Sleep Apnea 7 6 65
df = as.data.frame(sleep_vs_BMI)
chart = plot_ly(df, x = ~Var1, y = ~Freq, color = ~Var2, type = "bar")
chart %>% layout(title = "Comparison of People with Sleep Disorders Grouped by BMI Category",
yaxis = list(title = "Number of People"),
xaxis = list(title = "Sleep Disorder"),
barmode = "stack")This can be seen more clearly when grouping sleep disorders as existing (Yes) or not existing (No)
disorder_vs_BMI = table(data$Disorder.Exists, data$BMI.Category)
df2 = as.data.frame(disorder_vs_BMI)
chart = plot_ly(df2, x = ~Var1, y = ~Freq, color = ~Var2, type = "bar")
chart %>% layout(title = "Comparison of People with Sleep Disorders Grouped by BMI Category",
yaxis = list(title = "Number of People"),
xaxis = list(title = "Sleep Disorder Exists"),
barmode = "stack")If the data is grouped by a sleep disorder existing and the average hours of sleep, we can see that BMI has a large effect on sleep. People with a normal BMI have the highest average sleep duration, even for those that also have a sleep disorder. People that are overweight or obese have lower sleep durations, with the lowest for people that have a sleep disorder and are overweight.
bmi_and_dis_avgsleep = data |>
group_by(data$Disorder.Exists, data$BMI.Category) |>
summarise(avgsleep = mean(Sleep.Duration)); bmi_and_dis_avgsleep## # A tibble: 5 × 3
## # Groups: data$Disorder.Exists [2]
## `data$Disorder.Exists` `data$BMI.Category` avgsleep
## <chr> <chr> <dbl>
## 1 No Normal 7.41
## 2 No Overweight 6.8
## 3 Yes Normal 7.09
## 4 Yes Obese 6.96
## 5 Yes Overweight 6.77
Women get slightly more sleep than men, with an average of 7.23 hours a night compared to 7.04
## # A tibble: 2 × 2
## `data$Gender` avg_sleep
## <chr> <dbl>
## 1 Female 7.23
## 2 Male 7.04
There may be a correlation to higher physical activity levels and greater sleep duration. Countering this correlation are two groupings of women who got either the lowest amount of physical activity and highest sleep, or the highest physical activity level and lowest sleep
age_vs_sleep = plot_ly(data = data,
x = ~data$Physical.Activity.Level,
y = ~data$Sleep.Duration,
color = ~data$Gender,
colors = c("#7769f3", "#54bf49"),
type = "scatter",
mode = "markers")
age_vs_sleep %>% layout(title = "Comparison of Physical Activity Level and Sleep Duration",
xaxis = list(title = "Physical Activity Level"),
yaxis = list(title = "Sleep Duration"),
legend = list(title = list(text = "Gender")))If we replace gender with occupation, you can see that these two groupings are for women engineers (low activity high sleep) and women nurses (high activity low sleep)
age_vs_sleep2 = plot_ly(data = data,
x = ~data$Physical.Activity.Level,
y = ~data$Sleep.Duration,
color = ~data$Occupation,
type = "scatter",
mode = "markers")
age_vs_sleep2 %>% layout(title = "Comparison of Physical Activity Level and Sleep Duration",
xaxis = list(title = "Physical Activity Level"),
yaxis = list(title = "Sleep Duration"),
legend = list(title = list(text = "Occupation")))It is unclear what the distribution is for physical activity level based on the provided data. It could follow a right skewed exponential distribution if the full range of minutes of activity were present. Judging by this chart, we could assume that more people would have less than 30 minutes of activity than people with over 110 minutes of activity.
activity_dist = plot_ly(x = ~data$Physical.Activity.Level,
type = "histogram",
xbins = list(size = 15))
activity_dist %>% layout(title = "Distribution of Physical Activity Level",
xaxis = list(title = "Minutes of Activity", range = c(20,110)),
yaxis = list(title = "Frequency", range = c(0,90)))The boxplot of the distribution shows that the physical activity level is nearly perfect distributed between 30 and 90 minutes, with a Q1 of 45 minutes, a median of 60 minutes, and a Q3 of 75 minutes.
## Min Q1 Median Q3 Max
## 30 45 60 75 90
The sleep duration ranges from 5.8 hours to 8.5 hours of sleep, with a mean of 7.13 hours and a standard deviation of 0.796
##
## 5.8 5.9 6 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8
## 2 4 31 25 12 13 9 26 20 5 5 3 19 36 14 5 5 10 24 28
## 7.9 8 8.1 8.2 8.3 8.4 8.5
## 7 13 15 11 5 14 13
## [1] "The mean of the population is 7.132"
## [1] "The standard deviation of the population is 0.796"
Sleep duration does not have a clear distribution however, based on what we know about sleep, it could be assumed that it will follow a normal distribution, with a drop off in hours of sleep less than 5.5 hours and more than 8.5 hours.
sleep_dur = plot_ly(x = ~data$Sleep.Duration,
type = "histogram",
histnorm = "probability",
xbins = list(size = .3))
sleep_dur %>% layout(title = "Distribution of Sleep Duration",
xaxis = list(title = "Hours of Sleep", range = c(5.5,9)),
yaxis = list(title = "Frequency", range = c(0,.3)))Drawing samples of 1000 people using the mean and standard deviation of the population, you can see that sample sizes of 25, 50 and 75 people have increasingly narrow ranges centered around the mean with increasingly higher frequency, which proves the Central Limit Theorem.
set.seed(5919)
num_samples = 1000
size1 = 25
size2 = 50
size3 = 75
xmean1 = numeric(num_samples)
for (i in 1:num_samples) {
xmean1[i] <- mean(sample(data$Sleep.Duration, size1, replace = FALSE))
}
plot1 = plot_ly(x = xmean1, type = "histogram", histnorm = "probability")
plot1 %>% layout(title = "Sleep Duration Sample of 1000 of Size 25",
xaxis = list(title = "Hours of Sleep", range = c(5.5, 9)),
yaxis = list(title = "Frequency", range = c(0,.15)))xmean2 = numeric(num_samples)
for (i in 1:num_samples) {
xmean2[i] <- mean(sample(data$Sleep.Duration, size2, replace = FALSE))
}
plot2 = plot_ly(x = xmean2, type = "histogram", histnorm = "probability")
plot2 %>% layout(title = "Sleep Duration Sample of 1000 of Size 50",
xaxis = list(title = "Hours of Sleep", range = c(5.5, 9)),
yaxis = list(title = "Frequency", range = c(0,.15)))xmean3 = numeric(num_samples)
for (i in 1:num_samples) {
xmean3[i] <- mean(sample(data$Sleep.Duration, size3, replace = FALSE))
}
plot3 = plot_ly(x = xmean3, type = "histogram", histnorm = "probability")
plot3 %>% layout(title = "Sleep Duration Sample of 1000 of Size 75",
xaxis = list(title = "Hours of Sleep", range = c(5.5, 9)),
yaxis = list(title = "Frequency", range = c(0,.15)))The original data has 374 rows and a mean of 7.13 hours of sleep
## [1] "The mean of the population is 7.132"
Using simple random sampling of 50 people with replacement gives the following frequencies and mean
set.seed(5919)
n = 50
a = srswr(n, N)
rows1 = (1:N)[a!=0]
simple_random = data[a != 0,]
table(simple_random$Sleep.Duration)##
## 5.9 6 6.1 6.2 6.4 6.5 6.6 6.9 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 8 8.1 8.2 8.4
## 1 2 6 1 3 3 2 1 3 4 1 1 1 1 1 3 2 3 3 1
## 8.5
## 2
## [1] "The mean using simple random sampling is 7.129"
Using systematic sampling gives the following frequencies and mean
## [1] "Sample size of: 7"
## [1] "First person selected in first group: 6"
## [1] "All subsequent rows selected:"
## [1] 6 13 20 27 34 41 48 55 62 69 76 83 90 97 104 111 118 125 132
## [20] 139 146 153 160 167 174 181 188 195 202 209 216 223 230 237 244 251 258 265
## [39] 272 279 286 293 300 307 314 321 328 335 342 349
##
## 5.9 6 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 7.1 7.2 7.3 7.4 7.6 7.7 7.8 8.2 8.4 8.5
## 1 4 5 1 3 1 3 2 1 1 1 6 4 1 1 3 5 2 2 3
## [1] "The mean using systematic sampling is 7.068"
Using stratified sampling by gender gives the following frequencies and mean. Females and males are nearly equally represented, so the sample sizes for each are the same.
mod_data = data.frame(Gender = data$Gender, Sleep = data$Sleep.Duration)
set.seed(5919)
print("Frequencies of gender in data:")## [1] "Frequencies of gender in data:"
##
## Female Male
## 185 189
## [1] "Strata Sizes:"
##
## Female Male
## 25 25
stratified = sampling::strata(mod_data,
stratanames = c("Gender"),
size = strata_sizes,
method = "srswor",
description = TRUE)## Stratum 1
##
## Population total and number of selected units: 189 25
## Stratum 2
##
## Population total and number of selected units: 185 25
## Number of strata 2
## Total number of selected units 50
## Sleep Gender ID_unit Prob Stratum
## 29 7.9 Male 29 0.1322751 1
## 30 7.9 Male 30 0.1322751 1
## 36 6.1 Male 36 0.1322751 1
## 38 7.6 Male 38 0.1322751 1
## 40 7.6 Male 40 0.1322751 1
## 57 7.7 Male 57 0.1322751 1
## 65 6.2 Male 65 0.1322751 1
## 80 6.0 Male 80 0.1322751 1
## 90 7.3 Male 90 0.1322751 1
## 110 7.4 Male 110 0.1322751 1
## 147 7.2 Male 147 0.1322751 1
## 148 6.5 Male 148 0.1322751 1
## 166 7.6 Male 166 0.1322751 1
## 168 7.1 Male 168 0.1322751 1
## 169 7.1 Male 169 0.1322751 1
## 175 7.6 Male 175 0.1322751 1
## 177 7.6 Male 177 0.1322751 1
## 184 7.8 Male 184 0.1322751 1
## 199 6.5 Male 199 0.1322751 1
## 212 7.8 Male 212 0.1322751 1
## 217 7.8 Male 217 0.1322751 1
## 223 6.3 Male 223 0.1322751 1
## 224 6.4 Male 224 0.1322751 1
## 248 6.8 Male 248 0.1322751 1
## 278 8.1 Male 278 0.1322751 1
## 33 7.9 Female 33 0.1351351 2
## 70 6.2 Female 70 0.1351351 2
## 95 7.2 Female 95 0.1351351 2
## 101 7.2 Female 101 0.1351351 2
## 124 7.2 Female 124 0.1351351 2
## 141 7.1 Female 141 0.1351351 2
## 150 8.0 Female 150 0.1351351 2
## 189 6.7 Female 189 0.1351351 2
## 229 6.6 Female 229 0.1351351 2
## 246 6.5 Female 246 0.1351351 2
## 259 6.6 Female 259 0.1351351 2
## 269 6.0 Female 269 0.1351351 2
## 288 6.0 Female 288 0.1351351 2
## 294 6.0 Female 294 0.1351351 2
## 304 6.0 Female 304 0.1351351 2
## 318 8.5 Female 318 0.1351351 2
## 319 8.4 Female 319 0.1351351 2
## 321 8.5 Female 321 0.1351351 2
## 331 8.5 Female 331 0.1351351 2
## 335 8.4 Female 335 0.1351351 2
## 339 8.5 Female 339 0.1351351 2
## 346 8.2 Female 346 0.1351351 2
## 355 8.0 Female 355 0.1351351 2
## 370 8.1 Female 370 0.1351351 2
## 371 8.0 Female 371 0.1351351 2
## [1] "The mean using stratified sampling by gender is 7.284"
Considering what was learned earlier about the effect sleep disorders have on sleep duration, I’ve also used stratified sampling by sleep disorder. There are more people without sleep disorders so the strata for No is higher.
mod_data2 = data.frame(Disorder = data$Disorder.Exists, Sleep = data$Sleep.Duration)
set.seed(5919)
print("Frequencies of sleep disorder existing in data:")## [1] "Frequencies of sleep disorder existing in data:"
##
## No Yes
## 219 155
## [1] "Strata Sizes:"
##
## No Yes
## 29 21
stratified2 = sampling::strata(mod_data2,
stratanames = c("Disorder"),
size = strata_sizes2,
method = "srswor",
description = TRUE)## Stratum 1
##
## Population total and number of selected units: 219 29
## Stratum 2
##
## Population total and number of selected units: 155 21
## Number of strata 2
## Total number of selected units 50
## Sleep Disorder ID_unit Prob Stratum
## 36 6.1 No 36 0.1324201 1
## 37 6.1 No 37 0.1324201 1
## 40 7.6 No 40 0.1324201 1
## 42 7.7 No 42 0.1324201 1
## 44 7.8 No 44 0.1324201 1
## 62 6.0 No 62 0.1324201 1
## 71 6.1 No 71 0.1324201 1
## 86 7.2 No 86 0.1324201 1
## 93 7.5 No 93 0.1324201 1
## 107 6.1 No 107 0.1324201 1
## 121 7.2 No 121 0.1324201 1
## 122 7.2 No 122 0.1324201 1
## 123 7.2 No 123 0.1324201 1
## 135 7.3 No 135 0.1324201 1
## 137 7.1 No 137 0.1324201 1
## 138 7.1 No 138 0.1324201 1
## 144 7.1 No 144 0.1324201 1
## 150 8.0 No 150 0.1324201 1
## 157 7.2 No 157 0.1324201 1
## 168 7.1 No 168 0.1324201 1
## 182 7.8 No 182 0.1324201 1
## 206 7.7 No 206 0.1324201 1
## 211 7.7 No 211 0.1324201 1
## 212 7.8 No 212 0.1324201 1
## 317 8.5 No 317 0.1324201 1
## 321 8.5 No 321 0.1324201 1
## 331 8.5 No 331 0.1324201 1
## 337 8.4 No 337 0.1324201 1
## 360 8.1 No 360 0.1324201 1
## 17 6.5 Yes 17 0.1354839 2
## 19 6.5 Yes 19 0.1354839 2
## 68 6.0 Yes 68 0.1354839 2
## 94 7.4 Yes 94 0.1354839 2
## 105 7.2 Yes 105 0.1354839 2
## 190 6.5 Yes 190 0.1354839 2
## 201 6.5 Yes 201 0.1354839 2
## 221 6.6 Yes 221 0.1354839 2
## 228 6.3 Yes 228 0.1354839 2
## 233 6.6 Yes 233 0.1354839 2
## 240 6.4 Yes 240 0.1354839 2
## 251 6.8 Yes 251 0.1354839 2
## 259 6.6 Yes 259 0.1354839 2
## 282 6.1 Yes 282 0.1354839 2
## 287 6.0 Yes 287 0.1354839 2
## 298 6.1 Yes 298 0.1354839 2
## 346 8.2 Yes 346 0.1354839 2
## 347 8.2 Yes 347 0.1354839 2
## 349 8.2 Yes 349 0.1354839 2
## 361 8.2 Yes 361 0.1354839 2
## 373 8.1 Yes 373 0.1354839 2
## [1] "The mean using stratified sampling by sleep disorder is 7.174"
Comparing the mean sleep duration of these four sampling methods to the population average, systematic sampling was the lowest while stratified by gender was the highest. The sampling technique with a mean closest to the population would be simple random sampling with replacement, or stratified by sleep disorder as the next closest.
mean_comp = c(mean(data$Sleep.Duration),
mean(simple_random$Sleep.Duration),
mean(systematic$Sleep.Duration),
mean(strat_data$Sleep),
mean(strat_data2$Sleep))
names(mean_comp) = c("Population", "SimpleRandom", "Systematic", "Strata_Gender", "Strata_Disorder")
mean_comp## Population SimpleRandom Systematic Strata_Gender Strata_Disorder
## 7.132086 7.128889 7.068000 7.284000 7.174000
We have seen in the data that this group of people get 5.8 hours to 8.5 hours of sleep, with a mean of 7.13 hours, and women get more sleep than men. There may be a correlation to higher physical activity levels and greater sleep duration. More data would be needed, as there were two groupings of women that did not fit this correlation.
We’ve seen that having a sleep disorder is correlated to higher BMI, and higher BMI is also correlated to lower sleep duration. People who get the most sleep have no sleep disorder and a normal BMI.
Despite sleep duration not having a clear distribution in this dataset, the Central Limit Theorem holds true when pulling three different samples of varying sizes from a random sample of 1000.
When uses different sampling methods, the best method would be simple random sampling with replacement, or stratified by sleep disorder.